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Abstract 

We propose an approach to lossy source coding, utilizing ideas from Gibbs sampling, simulated annealing, 
and Markov Chain Monte Carlo (MCMC). The idea is to sample a reconstruction sequence from a Boltzmann 
K.^ , distribution associated with an energy function that incorporates the distortion between the source and reconstruction, 

I the compressibility of the reconstruction, and the point sought on the rate-distortion curve. To sample from this 

distribution, we use a 'heat bath algorithm': Starting from an initial candidate reconstruction (say the original source 
' sequence), at every iteration, an index i is chosen and the i"^ sequence component is replaced by drawing from 

the conditional probability distribution for that component given all the rest. At the end of this process, the encoder 
conveys the reconstruction to the decoder using universal lossless compression. 
^ , The complexity of each .terat.on .s mdependent of the sequence length and only hnearly dependent on a certam 

O . context parameter (which grows sub-logarithmically with the sequence length). We show that the proposed algorithms 

achieve optimum rate-distortion performance in the limits of large number of iterations, and sequence length, when 
employed on any stationary ergodic source. Experimentation shows promising initial results. 
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\^ ' Employing our lossy compressors on noisy data, with appropriately chosen distortion measure and level, followed 

by a simple de-randomization operation, results in a family of denoisers that compares favorably (both theoretically 
and in practice) with other MCMC-based schemes, and with the Discrete Universal Denoiser (DUDE). 
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, I. INTRODUCTION 

Consider the basic setup of lossy coding of a stationary ergodic source X = {Xi : i > 1}. Each source output 
block of length n, X"-, is mapped to an index fni^X"^) of nR bits, where R can be either constant (fixed-rate coding) 
or depend on the block that is coded (variable -rate coding). The index /„(X") is then losslessly transmitted to the 
decoder, and is decoded to a reconstruction block X" = g„(/„(X")). Two main performance measures for a lossy 
coding scheme C = {fn,gn,n) are the following: i) distortion D defined as average expected distortion between 
source and reconstruction blocks, i.e., 

n 

i^^Ed„(X",X") ^ -V Ed(X„X,), (1) 
n ^-^ 

where d : X y. X ^ R+ is a single-letter distortion measure, and ii) rate R defined as the average expected number 
of bits per source symbol, i.e., E[i?]. For any I? > 0, and stationary process X the minimum achievable rate (cf. |Q]| 
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for exact definition of achievability) is characterized as fj], S, H 

R{D,X)^ \iiR min -I{X";X"). (2) 

"-^°°p(X"|X"):Ed„(X",X")<D " 

For the case of lossless compression, we know that the minimum required rate is the entropy rate of the source, 
i.e. -ff(X) = lim H{Xo\XZl), and there are known implementable universal schemes, such as Lempel-Ziv coding 
lH] and arithmetic coding ||6], that are able to describe any stationary ergodic source at rates as close as desired 
to the entropy rate of the source without any error In contrast to the situation of lossless compression, neither the 
explicit solution of (|2]i is known for a general source (not even for a first-order Markov source Q), nor are there 
known practical schemes that universally achieve the rate-distortion curve. 

One possible intuitive explanation for this sharp dichotomy is as follows. The essence of universal lossless 
compression algorithms is learning the source distribution, and the difference between various coding algorithms 
is in different efficient methods through which they accomplish this goal. Universal lossy compression, on the 
other hand, intrinsically consists of two components; quantization and lossless compression. This breakdown can 
be explained more clearly by the following characterization of the rate-distortion function ||8]: 

R{D,X.) =mi{H{Z) : Ed{Xi, Zi) < D}, (3) 

where the infimum in over jointly stationary ergodic processes with X. This alternative representation suggests 
that for coding a process X one should quantize it, either implicitly or explicitly, to another process Z, which is 
sufficiently close to it but more compressible, and then compress process Z via a universal lossless compression 
algorithm. The quantization step in fact involves a search over the space of all jointly stationary ergodic processes, 
and explains to some extent the reason why universal lossy compression is more intricate than universal lossless 
compression. 

In this paper, we present a new approach to implementable lossy source coding, which borrows two well-known 
tools from statistical physics and computer science, namely Markov Chain Monte Carlo (MCMC) methods, and 
simulated annealing ifTOl . MCMC methods refer to a class of algorithms that are designed to generate samples of 
a given distribution through generating a Markov chain having the desired distribution as its stationary distribution. 
MCMC methods include a large number of algorithms; For our application, we use Gibbs sampler ifTTI also known 
as the heat bath algorithm, which is well-suited to the case where the desired distribution is hard to compute, but 
the conditional distributions of each variable given the rest are easy to work out. 

The second required tool is simulated annealing which is a well-known method in discrete optimization problems. 
Its goal is to find the the minimizing state s^in of a function /(s) over a set of possibly huge number of states 

S, i.e., Smin = argmin /(s). In order to do simulated annealing, a sequence of probability distributions pi,p2, • • ■ 

ses 

corresponding to the temperatures 7i > T2 > . . ., where as z — > 00, and a sequence of positive integers 

A^i, N2, ■ . ., are considered. For the first A^i steps, the algorithm runs one of the relevant MCMC methods in an 
attempt to sample from distribution pi . Then, for the next N2 steps, the algorithm, using the output of the previous 
part as the initial point, aims to sample from p2, and so on. The probability distributions are designed such that: 1) 
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their output, with high probability, is the minimizing state s^in, or one of the states close to it, 2) the probability 
of getting the minimizing state increases as the temperature drops. The probability distribution that satisfies these 
characteristics, and is almost always used, is the Boltzman distribution p^(s) oc e~^^^'^\ where /3 oc y. It can 
be proved that using Boltzman distribution, if the temperature drops slowly enough, the probability of ultimately 
getting the minimizing state as the output of the algorithm approaches one lITTji . Simulated annealing has been 
suggested before in the context of lossy compression, either as a way for approximating the rate distortion function 
(i.e., the optimization problem involving minimization of the mutual information) or as a method for designing 
the codebook in vector quantization |[T2l . |[T3l . as an alternative to the conventional generalized Lloyd algorithm 
(GLA) lfT4l . In contrast, in this paper we use the simulated annealing approach to obtain a particular reconstruction 
sequence, rather than a whole codebook. 

Let us briefly describe how the new algorithm codes a source sequence x". First, to each reconstruction block 
y", it assigns an energy, ^(y"), which is a linear combination of its conditional empirical entropy, to be defined 
formally in the next section, and its distance from the source sequence x". Then, it assumes a Boltzman probabiUty 
distribution over the reconstruction blocks as p(y") oc e^^^'^^ \ for some /3 > 0, and tries to generate i" from this 
distribution using Gibbs sampling ifTDl . As we will show, for /3 large enough, with high probability the reconstruction 
block of our algorithm would satisfy £(a;") w min £(?/"). The encoder will output LZ(x"), which is the Lempel-Ziv 
Is) description of x". The decoder, upon receiving LZ{x"), reconstructs i" perfectly. 

In this paper, instead of working at a fixed rate or at a fixed distortion, we are fixing the slope. A fixed slope 
rate-distortion scheme, for a fixed slope s = —a < 0, looks for the coding scheme that minimizes R + aD, where 
as usual R and D denote the rate and the average expected distortion respectively. In comparison to a given coding 
scheme of rate R and expected distortion D, for any Q < S < R — R{D, X), there exists a code which works at rate 
X) + S and has the same average expected distortion, and consequently a lower cost. Therefore, it follows 
that any point that is optimal in the fixed-slope setup corresponds to a point on the rate-distortion curve. 

A. Prior work 

The literature on universal lossy compression can be divided into two main categories: existence proofs and 
algorithm designs. The early works in this area were more about proving the existence of a family of codes 
{n, fm Qn) that achieves the optimal performance, R{D, X), asymptotically for any stationary ergodic process ifTSll . 
lfT6l . IITtI . lITSl . IIT9I . II20I . After the existence of the so-called universal codes were shown, the next step was 
finding such algorithms. We will here briefly review some of the work on the latter. This section is not meant to 
be a thorough review of the literature on universal lossy compression algorithms, but just a brief overview of some 
of the more famous results to the knowledge of the authors. 

One popular trend in finding universal lossy compression algorithms has been extending universal lossless 
compression algorithms to the lossy case. As an example of such attempts is the work by Cheung and Wei II2TI who 
extended the move-to-front transform 1221 . There has also been a lot of attempt on extending the string-matching 
ideas used in the well-known Lempel-Ziv coding to the lossy case: Morita and Kobayashi ll23l proposed a lossy 
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version of LZW algorithm and Steinberg and Gutman ll24l suggested a fixed-database lossy compression algorithms 
based on string-matching. These algorithms have the same spirit of LZ coding, and similar to the LZ code are easy 
to implement. However, all these extensions, as were later shown by Yang and Kieffer ESl . are suboptimal even for 
memoryless sources. Another suboptimal but practical universal lossy compression algorithm based on approximate 
pattern matching is the work of Luczak and Szpankowski 1261 . 

Zhang and Wei 1271 proposed an online universal lossy data compression algorithm, called 'gold-washing', which 
involves continuous codebook refinement. The algorithm is called online meaning that the codebook in constructed 
simultaneously by the encoder and the decoder as the source symbols arrive, and no codebook is shared between the 
two before the coding starts. Most of the previously mentioned algorithms fall into the class of online algorithms 
as well. 

More recently, a new lossy version of LZ algorithm has been proposed by Kontoyiannis l28l which instead of 
using a fixed database which has the same distribution as the source, employs multiple databases. The encoder is 
allowed to choose one of the databases at each step. These multiple databases essentially let the encoder tune the 
reconstruction distribution gradually to the optimal distribution that corresponds to the source distribution. It is a 
fixed-distortion code and is shown to be optimal, at least, for memoryless sources. 

There are also universal lossy compression algorithms that are interesting from a theoretical point-of-view, but 
infeasible to be implemented because of their huge computational complexity. One can refer to the works by 
Ornstein and Shields l29l . Yang and Kieffer l30l . and more recently Neuhuff and Shields l3TI for examples of 
such results. 

As mentioned earlier in this paper the encoder, instead of fixing rate or distortion, fixes the slope. The idea 
of fixed-slope universal lossy compression was first proposed by Yang, Zhang and Berger in 11321 . In their paper, 
they first propose an exhaustive search coding algorithm which is very similar to the algorithm proposed propose 
in Section |III] After establishing its universality for lossy compression of stationary ergodic sources, they suggest 
some heuristic approach for finding an approximation to its solution. In our case, the special structure of our cost 
function enables us to employ simulated annealing plus Gibbs sampling to approximate its minimizer 

For the non-universal setting, specifically the case of lossy compression of an i.i.d. source with a known 
distribution, there is an ongoing progress towards designing codes that get very close to the optimal performance 

Ea, m, Ea, Eg. 

B. Paper organization 

The organization of the paper is as follows. In Section HIl we set up the notation. Section |lll] describes an 
exhaustive search scheme for fixed-slope lossy compression which universally achieves the rate-distortion curve for 
any stationary ergodic source. Section |IV] describes our new universal MCMC-based lossy coder, and Section [V] 
presents another version of the algorithm for finding sliding-block codes which again universally attain the rate- 
distortion bound. Section |VI] gives some simulations results. Section IVIll describes the application of the algortihm 
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introduced in Section |IV] to universal compression-based denoising. Finally, Section FVIIII concludes the paper with 
a discussion of some future directions. 

II. Notation 

Let X = {Xi; V i E 1N+} be a stochastic process defined on a probability space (X, S, fi), where E denotes the 
(T-algebra generated by cylinder sets C, and /i is a probability measure defined on it. For a process X, let X denote 
the alphabet of Xi, which is assumed to be finite. The shift operator T ; X°° X°° is defined by 

(rx)„ ==a;„+i, xeA'°°,n>l. 

For a stationary process X, let H(X.) denote its entropy rate defined as HpCj = lim H{Xn+i\X^). 

Calligraphic letters, X, y, etc, are always assumed to refer to sets, and usually represent the alphabet sets 
of random variables. The size of a set A is denoted by |^|. Specifically, let X and X denote the source and 
reconstruction alphabets respectively. 

For G y", define the matrix m(y") e M'-^I x rI-^I to be (fc + 1)"' order empirical count of y", i.e., its 
(/3,b)'^ element is defined as 

™0,b(y") = \{k + l<i<n: = b,y, = /3]}| , (4) 

where b e 3^*^, and /? G 3^. Let Hk.{y^) denote the conditional empirical entropy of order k induced by y", i.e., 

Hu{yn=H{Yk+i\Y''), (5) 

where Y^^^ on the right hand side of ^ is distributed according to 

P(r''^+i = [b,/3])=m^,b(y"). (6) 

For a vector v = (wi, . . . ,viY' with non-negative components, we let ■H(v) denote the entropy of the random 
variable whose probability mass function (pmf) is proportional to v. Formally, 

f E ^ log — if V ^ (0, . . . , 0)^ 

[ if v = (0,...,0)^, 

where log(O) = by convention. The conditional empirical entropy in (|5]l can be expressed as a function of 
m(?/") as follows 

= - E ^ (m.,b(2/")) l^m,b(y"), (8) 

b 

where 1 and m..b(j/") denote the all-ones column vector of length |3^|, and the column in m(y") corresponding 
to b respectively. 

For vectors u and v both is R", let ||u — v||i denote the £i distance between u and v, defined as follows 

n 

||u - v||i = E ~ ""^l- (9) 

1=1 

Also the total variation between the two vectors is defined as 

||u - v||tv = ^||u - v||i. (10) 
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III. An exhaustive search scheme for fixed-slope compression 

Consider the following scheme for lossy source coding at a fixed slope a > 0. For each source sequence let 
the reconstruction block i" be 

= argmin [i/fe(y") + ad„(x",2/")] . (11) 
y" 

The encoder, after computing losslessly conveys it to the decoder using LZ compression. 

Theorem 1: Let X be a stationary ergodic source, let i?(_D,X) denote its rate distortion function, and let X" 
denote the reconstruction using the above scheme on X'"-. Then 

-iiziX") +adn(X'',X")''^ inm\R(D.X) +aD], a.s. (12) 

In words, the above scheme universally attains the optimum rate-distortion performance at slope a for any stationary 
ergodic process. The drawback of the described algorithm is its computational complexity; It involves exhaustive 
search among the set of all possible reconstructions. The size of this set is \X\''^ which grows exponentially fast 
with n. 



Remark 1: Although the exhaustive search algorithm described above is very similar to the generic algorithm 
proposed in ll32ll . they are in fact different. The algorithm proposed in 1321 is as follows 



argmin 



-Z(y") + ad„(x",y") 



(13) 



where Z(y") is the length of the binary codeword assigned to y" by some universal lossless compression algorithm. 
From this definition, should satisfy the following two conditions: 

1) For any n E M, 

2-'(^") < 1. 

2) For any stationary ergodic process X, 



lim -1{X") = H{X), a.s. 



(14) 



But conditional empirical entropy, is not a length function {J2 2~"-^'"(?'") > 2-"-f^'=(0- -")+2~"^'=(i'- -i) = 2, 

for any k and n). Hence, the algorithm proposed above is not an special case of the generic algorithm proposed in 



Remark 2: Although as described in Remark [T] Hk{-) is not a length function itself, it has a close connection to 
length functions, specifically to ^lz(-)- This link is described bt Ziv inequality 1371 which states that if /s„ = o(log7i), 
then for any e > 0, there exists E such that for any individual infinite-length sequence y = (j/i, y2, ■ ■ .) and 
any n > N^, 



< e. 



(15) 



As described in Section H] the process of universal lossy compression can be divided into two steps: quantization 
and universal lossless compression. The second step which involves universal lossless compression of the quantized 
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sequence is extensively studied in the literature already and can be done efficiently using existing coders. Hence in 
this paper we focus on the first step, and try to show that it can be done efficiently via simulated annealing. 
Proof of Theorem\J] From part (1) of Theorem 5 in 



lim inf 



-4z(X") + ad(X",X") 



> min[i?(i:),X) a.s. 

D>Q ^ ^ ■ ' 



(16) 



which says that the probability that a sequence of codes asymptotically beats the fundamental rate-distortion limit 
is zero. 

In order to establish the upper bound, we split the cost function into two terms as follows 



1 



-4z(^") + ad(X",X") 



-4z(^") - Hk„iX") + i/fe„(X") + 



i7fc„(X") + arf(X",X") 



(17) 
(18) 



From ifJTl , for fc„ = o(logn) and any given e > 0, there exists iVe G IN such that for any individual infinite-length 
sequence x = {xi,X2, ■ ■ •) and any n > N,^, 



< e. 



(19) 



Consider an arbitrary point {R{D, X), D) on the rate-distortion curve corresponding to source X. Then for any 
6 > there exists a process X such that (X, X) are jointly stationary ergodic, and moreover lUl 

1) H{±) < R{D,X), 

2) Ed{Xo,Xo) < D + S. 

Now since for each source block X", the reconstruction block X" is chosen to minimize Hk{X^) + ad{X^ , X^), 
we have 



For a fixed fc, from the definition of the fc*'* order entropy, we have 

Hk{X^) = ^ E (m,„(X")) , 

where 

1 1 " 

w.p.l. 



1=1 



V[X%=u 



fc+i 



(20) 



(21) 



(22) 



(23) 



Therefore, combining ^2\\ and (|23T l, as n goes to infinity, Ht^{X"-) converges to H{Xq\Xz}^) with probability one. 
It follows from the monotonicity of iffc(a;") in fc, ( |20l i, and the convergence we just established that for any x" 
and any k. 



i7fe„(X") +ad(X",X") < H{Xo\XZl) + e + ad{X'',X''), eventually a.s. 



(24) 
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On the other hand 



n 

rj — ^ 



Combining ( fT9l l and (l24l) yields 

1 



lim sup 



4z(X") + ad(X",X") 



<i/(Xo|XI;i) + 2e + a(£i + (5) a.s. 



The arbitrariness of k, e and 5 imphes 



lim sup 



-4z(^") + «rf(X",l") 

n 



< R{D,X) +aD, a.s. 



for any D >0. Since the point {R{D,X.),D) was also chosen arbitrarily, it follows that 



lim sup 



Finally, combining (fTST l. and ( |28] | we get the desired result: 

1 



lim 



-4z(X") 



< min[i?(D,X) 

D>0 



min[i?(i:»,X) 

_D>0 



(25) 



(26) 



(27) 



(28) 



(29) 



Remark 3: As mentioned above, since Hk{-) is not itself a length function, there is a difference between the 
algorithm mentioned here, and the one proposed in ||32|. However, as we will argue shortly, one can establish a 
connection between the two, and derive the following result directly from the theorem proved in 



i/fe(X") + X") ™ mill [R{D, X) + aD] , a.s., 



D>0 



(30) 



where again X" is a minimizer of ( fTTT i. 

Consider the following entropy coding scheme for describing a sequence E y". First, divide into 13^1*^ 
subsequences {j/b'^ibei'''- Each subsequence corresponds to a vector b e J^'"", and consists of those symbols in y" 
which are preceded by b. From our definitions. 

Now describing the sequence y" can be done by describing the mentioned subsequences to the decoder separately. 
Note that the decoder can merge the subsequences and form the original sequence ?/" easily if it knows the first k 
symbols as well. For describing the subsequences, we first send the matrix m(j/") to the decoder For doing this 
at most 13^1'^+^ [log rt] bits are required. After having access to the matrix m, for each subsequence, the decoder 
finds its length n^, and also the number of occurrences of each symbol within it. Then, since there only exists 



rib 



(31) 



'^TOai.b, •■• .rirua^^b 

such sequences, the encoder is able to describe the sequence of interest within this set by just sending its index. 
But from Stirling approximation, i.e.. 



'27rn 1 — 1 e" 
. e - 
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where i2n+i — "^'i — T2n' follows that the required number of bits for sending the index can be written as 

= nHixn.^Y,) + nri{k,n), (32) 

nma-,,h, ■■■ -.nmaj^M j 

where ri{k,n) = o(l) and does not depend on b or j/^''. Denoting the overall number of bits required by this 
coding scheme for coding the sequence by le{y^) it follows that the number of bits per symbol is 

|3^|S/(fc,n) + |3^|''^+i(logn+l) 



-^e(2/") = $]-H(m,b) 



= i?fc(y") + C(fc,n), (33) 

where ({k, n) = 1^1 ri(k,n)+\y\ (logw+i) _ ^^-^^ which again does not depend on y". From our construction, 

clearly /e(') is a length function. Moreover, since C{k,n) does not depend on y", 

argmin[^Ze(y") +a(i„(x",y")] = argmin[i?A:(y") + C(fc, + ad„(a;", y")], 



y y 



argmin[i?fe(y") + ad„(.T", y")]. (34) 



Therefore, 



lim [iZe(y") +adn(^",^")] = lim [Hfe(l") + ad„(X", X")] 

n— >-oo 77, 71— >-oo 



= min[i?(i:),X) + , a.s. (35) 

D>0 

IV. Universal lossy coding via MCMC 

In this section, we will show how simulated annealing Gibbs sampling enables us to get close to the performance 
of the impractical exhaustive search coding algorithm described in the previous section. Throughout this section we 
fix the slope a > 0. 

Associate with each reconstruction sequence y" the energy 

£(2/")^n[iffc(2/") + ad„(x",y")] 

n 

= ^ l^ni.,^{y")n{ni.^^{y")) + aY,d{x,,y,). (36) 

uex'' ''=1 

The Boltzmann distribution can now be defined as the pmf on A"" given by 

P/3(y") = ^exp{-/3£(y")}, (37) 

where is the normalization constant (partition function). Note that, though this dependence is suppressed in the 
notation for simplicity, £{y"), and therefore also pp and Zp depend on x" and a, which are fixed until further 
notice. When /3 is large and F" ^ pp, then with high probability 

Hk{Y") + adn{x^,Y^) « mill + ad„(a;", y")] . (38) 
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Thus, for large /3, using a sample from the Boltzmann distribution pfj as the reconstruction sequence, would yield 
performance close to that of an exhaustive search scheme that would use the achiever of the minimum in ( l38T l. 
Unfortunately, it is hard to sample from the Boltzmann distribution directly. We can, however, get approximate 
samples via MCMC, as we describe next. 

As mentioned earlier, the Gibbs sampler ifTDl is useful in cases where one is interested in sampling from a 
probability distribution which is hard to compute, but the conditional distribution of each variable given the rest 
of the variables is accessible. In our case, the conditional probability under pf; of Yi given the other variables 
Y^V A : n ^ i} can be expressed as 



p,(Y- = alF-V = ynV) = pM^. = «,^"\'=y"\') 

E cxp{-/3£(2/-iH'Vi)}' 



E exp{-/3n [iJfc(2/*-i6y,'Vi) + a(i„(x", y^-i&yf+i)] } ' 

hex 

1 

E exp{-/3 [nAHkiy^-^by^^,,a) + a^d{h, a,Xi)] } ' 

hex 



(39) 



(40) 



(41) 



(42) 



where AHk{y^ ^byf_^i,a) and Ad{y^ ^byf_^i,a,Xi) are defined as 

AHk{f~%:^„a) ^ Hkif~%^^,) - i/fc(y-iayIVi), (43) 

and 

Ad{b, a, Xi) = d{b, Xi) — d{a, Xi), 

respectively. 

Evidently, pf}(Y, = = y"^^ depends on y" only through {Hk{y'~'^by^_^j^) - Hkiy'-~'^ay2^^)}^ f^^;^ and 

{d{x,,a)}^^^. In turn, {Hkiy'^'^by^^^) - Hk{y'^'^ayf_^i)}a^b depends on y" only through {i[n{y'~^by^_^^)}b. 

Note that, given m(y" ), the number of operations required to obtain m.{y^~^byf^^), for any b G X is linear in k, 
since the number of contexts whose counts are affected by the change of one component of y" is at most 2k + 2. 
To be more specific, letting iSi(y", b) denote the set of contexts whose counts are affected when the i*^ component 
of ?/" is flipped from yi to b, we have 6)| < 2k + 2. Further, since 

n[H,{y^-%-^,) - Hk{y'-'ay-+,)] = 

ue5i(i/'-ibi/i+i,a) 

[l^m.,„(2/^-i62/r+i)H (m.u(y'-i6yr+i)) - l^m.,„(y-iaj/r+i)^ (m.^^if-'ay^^,))] , (44) 

it follows that, given m(y*^^6?/JYi) ^"d Hk{y''~^by2^i), the number of operations required to compute m(y'^^a?;|Yi) 
and Hk{y^~^ayl]^i) is linear in k (and independent of n). 
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Now consider the following algorithm (Algorithm 1 below) based on the Gibbs sampling for sampling from pp, 
and let denote its (random) outcome when taking k ^ kn and (3 = {Pt]t to be deterministic sequences 



satisfying /s„ = o(logn) and /3t = — ^ log([^J + 1), for some Tq""* > nA, where 



(45) 



IX = max max 

applied to the source sequence as inputQ By the previous discussion, the computational complexity of the 
algorithm at each iteration is independent of n and linear in k. 
Theorem 2: Let X be a stationary ergodic source. Then 



lim lim 



Proof: The proof is presented in Appendix A. 



= Tain\R{D ,1C) + aD] , a.s. 

D>0 



(46) 



Algorithm 1 Generating the reconstruction sequence 



Input: x", k, a, {Pt}t, r 

Output: a reconstruction sequence a;" 

1: y" <- a-" 

2: for t = 1 to r do 

3: Draw an integer i e {1, . . . , n} uniformly at random 

4: For each b e X compute pptiY^ ~ 5|y"\' = 2;"^') given in ( |42]) 

5: Update J/" by replacing its i"^ component j/i by Z, where 

6: Update m(j/") and Hfe(?/") 
7: end for 

8: ^ y" 



V. Sliding-window rate-distortion coding via MCMC 

The classical approach to lossy source coding is block coding initiated by Shannon In this method, each 
possible source block of length n is mapped into a reconstruction block of the same length. One of the disadvantages 
of this method is that applying a block code to a stationary process converts it into a non-stationary reconstruction 
process. Another approach to the rate-distortion coding problem is sliding-block (SB), a.s. stationary, coding 
introduced by R.M. Gray, D.L. Neuhoff, and D.S. Ornstein in ll38l and also independently by K. Marton in ||39l both 
in 1975. In this method, a fixed SB map of a certain order 2kf + 1 slides over the source sequence and generates 

' Here and throughout it is implicit that the randomness used in the algorithms is independent of the source, and the randomization variables 
used at each drawing are independent of each other. 
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the reconstruction sequence which has lower entropy rate compared to the original process. The advantage of this 
method with respect to the block coding technique is that while the achievable rate-distortion regions of the two 
methods provably coincide, the stationarity of the source is preserved by a SB code BOl . Although SB codes seem 
to be a good alternative to block codes, there has been very little progress in constructing good such codes since 
their introduction in 1975, and to date there is no known practical method for finding practical SB codes. In this 
section we show how our MCMC-based approach can be applied to find good entropy-constrained SB codes of a 
certain order 2k f + 1. 

There are a couple of advantages in using SB codes instead of block codes. One main benefit is getting rid of 
the blocking artifacts resulting from applying the code to non-overlappying adjacent blocks of data. This issue has 
been extensively studied in image compression, and one of the reasons wavelet transform is preferred over more 
traditional image compression schemes like DCT is that it can be implemented as a sliding-window transform, and 
therefore does not introduce blocking artifacts BDl . The other advantage of SB codes is in terms of speed and more 
memory-efficiency. 

Remark 4: There is a slight difference between SB codes proposed in |[38l , and our entropy-constrained SB 
codes. In ||38l . it is assumed that after the encoder converts the source process into the coded process, with no more 
encryption, it can be directly sent to the decoder via a channel that has capacity of R bits per transmission. Then 
the decoder, using another SB code, converts the coded process into the reconstruction process. In our setup on 
the other hand, the encoder directly converts the source process into the reconstruction process, which has lower 
entropy, and then employs a universal lossless coder to describe the coded sequence to the decoder The decoder 
then applies the universal lossless decoder that corresponds to the lossless encoder used at the encoder to retrieve 
the reconstruction sequence. 

A SB code of window length 2kf + 1, is a function f : X'^^f^^ X which is applied to the source process {X„} 
to construct the reconstruction block as follows 



Therefore, for specifying a SB code of window length 2k f + 1, there are Kf values to be determined, and f can 
be represented as a vector = [/o, /i, . . . , ./a'/-i] where fi e A" is the output of function f to the input vector 
b equal to the expansion of i in 2k f + 1 symbols modulo \X\, i.e., » = &il<%'|"'- 

For coding a source output sequence x" by a SB code of order 2k j + 1, among lA"!' ' possible choices, 
similar to the exhaustive search algorithm described in Section |IV] here we look for the one that minimizes the 
energy function assigned to each possible SB code as 



(47) 



The total number of {2k f + l)-tuples taking values in X is 




2fc/ + l 



£U''n = n[Hk{y")+ad^{x^,y^')\, 



(48) 
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where y" = /^^] is defined by y.i ~ f(x*^^p. Like before, we consider a cyclic rotation as Xi = x,;+„, for 

any i e E^. Again, we resort to the simulated annealing Gibbs samphng method in order to find the minimizer of ( |48] |. 
Unlike in ( l37b . instead of the space of possible reconstruction blocks, here we define Boltzmann distribution over the 
space of possible SB codes. Each SB code is represented by a unique vector , and ppif^^ ) oc exp {—f3£{y")), 
where y" — y"[a;",/^^]. The conditional probabilities required at each step of the Gibbs sampler can be written 
as 

^Mi^-^i/"-^')- "^^(r!S^, - (49) 

1? 



1 



1? 



(50) 



Therefore, for computing the conditional probabilities we need to find out by how much changing one entry of f^f 
affects the energy function. Compared to the previous section, finding this difference in this case is more convoluted 
and should be handled with more deliberation. To achieve this goal, we first categorize different positions in a;" 
into different types and construct the s" vector such that the label of Xi, at, is defined to be 

2;„+,|A'|^/+-'". (51) 

In other words, the label of each position is defined to be the symmetric context of length 2fc/ + 1 embracing 
it, i.e., x'i^jfj- Using this definition, applying a SB code f^i to a sequence x" can alternatively be expressed as 
constructing a sequence y" where 

Vi = fa, ■ (52) 

From this representation, changing fi from 6 to -0 while leaving the other elements of f^f unchanged only affects 
the positions of the y" sequence that correspond to the label i in the s" sequence, and we can write the difference 
between energy functions appearing in ( fSOl l as 

n[iJfc(m(y")-iJfc(m(y")]+a ^ , z9) - , 0)), (53) 

where y" and y" represent the results of applying P~^f^^\ and p^ Ofi^i to x" respectively, and as noted before 
the two vectors differ only at the positions {j : aj = i}. Flipping each position in the y" sequence in turn affects at 
most 2(fc + 1) columns of the count matrix m(y"). Here at each pass of the Gibbs sampler a number of positions 
in the y" sequence are flipped simultaneously. Algorithm |2] describes how we can keep track of all these changes 
and update the count matrix. After that in analogy to Algorithm [T] Algorithm [3] runs the Gibbs sampling method 
to find the best SB code of order 2fc/ + 1, and at each iteration it employs Algorithm |2] 

Let fpj^^ denote the output of Algorithm [3] to input vector x" at slope a after r iterations, and annealing process 
/?. i^^j"^ ~ 2^^^/ denotes the length of the vector / representing the SB code. The following theorem proved 
in Appendix B states that Algorithm |3]is asymptotically optimal for any stationary ergodic source; i.e., coding a 
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Algorithm 2 Updating the count matrix of y" = f(x"), when /j changes from 6 to d 
Input: x", kf, k, m(?/"), i, i9, 9 

Output: ni(y") 

1: a" ^ 

2: y" ^ 

3: for j = 1 to 71 do 
4: if Q!j = i then 

5: ^ 6* 

6: end if 

7: end for 

8: ni(y") m(y") 

9: for j = fc/ + 1 to 71 — kf do 



10: 


if Qfj = 1 then 




11: 


af^ ^ 1 




12: 


end if 




13: 


end for 




14: 


for j = fc + 1 to 


77 — fc do 


15: 


if aj = 1 then 




16: 


771 j-1 -i— 
Vj-Vj-k 


771 j-1 
Vj-Vj-k 


17: 


771. -j-l -i— 


771. 

Vl^Vj-k 


18: 


end if 




19: 


end for 





- 1 
+ 1 



Algorithm 3 Universal SB lossy coder based on simulated annealing Gibbs sampler 
Input: x", kf, k, a, (3, r 

Output: f'^i 

1: for t = 1 to r do 

2: Draw an integer i e {1, . . . , Kf} uniformly at random 

3: For each a <E X compute Pfitifi = ^'|/^^^') using Algorithm|2l equations ( ISOl l, and ( |53] ) 

4: Update by replacing its i"^ component fi by 6* drawn from the pmf computed in the previous step 

5: end for 



source sequence by applying the SB code ^ ^, to the source sequence, and then describing the output to the 
decoder using Lempel-Ziv algorithm, asymptotically, as the number of iterations and window length kf grow to 
infinity, achieves the rate-distortion curve. 
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Theorem 3: Given a sequence {kj} such that kj oo, schedule (3^ = log([ ^(„) J + 1) for some 
Tq^"^ > if/ A, where 

A = nmx max \£{r-Hff_^-^)^£{r-^bf^_^{)\, (54) 

I fr+,ex''f-\ 
[ s,e £ X 

and k — o(logn). Then, for any stationary ergodic source X, we have 

= min [R{D, X) + aD] , (55) 

Li>0 
K f 

where X" is the result of applying SB code „ ^ to X". 

Proof: The proof is presented in Appendix B. ■ 
Note that in Algorithm |3] for a fixed kf, the SB code is a vector of length Kf = Hence, the size of 

the search space is jA"!^' which is independent of n. Moreover, the transition probabilities of the SA as defined 
by ( |50l ) depend on the differences of the form presented in ( |53] ), which, for a stationary ergodic source and fixed 
kf, if n is large enough, linearly scales with n. I.e., for a given /*~ , f^^i, "d and 9, 

Mm -[£{r'^f!i{)-£ir-'d.f!l{)]=q a.s., (56) 

n— foo 77, 

where q E [0, 1] is some fixed value depending only on the source distribution. This is an immediate consequence 
of the ergodicity of the source plus the fact that SB coding of a stationary ergodic process results in another process 
which is jointly stationary with the initial process and is also ergodic. On the other hand, similar reasoning proves 
that A defined in ( |54] | scales linearly by tt,. Therefore, overall, combining these two observations, for large values of 
77, and fixed kf, the transition probabilities of the nonhomogeneous MC defined by the SA algorithm incorporated 
in Algorithm [3] are independent of 77. This does not mean that the convergence rate of the algorithm is independent 
of n, because for achieving the rate-distortion function one needs to increase kf and n simultaneously to infinity. 

VI. Simulation RESULTS 

We dedicate this section to the presentation of some initial experimental results obtained by applying the schemes 
presented in the previous sections on simulated and real data. The Sub-section I VI- A| demonstrates the performance 
of Alg. [T]on simulated 1-D and real 2-D data. Some results on the application Alg. |3]on simulated ID data is 
shown in Sub-section IVI-CI 

A. Block coding 

In this sub-section, some of the simulation results obtained from applying Alg. [T] of Section |IV] to real and 
simulated data are presented. The algorithm is easy to apply, as is, to both 1-D and 2-D data . 

As the first example, consider a Bcrn(p) i.i.d source. Fig. [T] compares the optimal rate-distortion tradeoff against 
Alg. [T] performance for p = 0.4 respectively. The algorithm parameters are 77 = 15 x 10'^, k = S), j3t ~ (1/7)'^*^"^ 
where 7 — 0.75, and a = 4 : —0.4 : 2. Each point corresponds to the average performance over = 50 iterations. 



lim lim E 
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Fig. 1. Comparing the Alg. [T] performance with the optimal rate-distortion tradeoff for a Bcrn(p) i.i.d. source, p = 0.4 (n = 15 X 10^, 
k = 9, Pt = (1/7) f'/""!, 7 = 0.75, r = lOn and a = 4 ; -0.4 ; 2). 



At each iteration the algorithm starts from a = 4, and gradually decreases the coefficient by 0.4 at each step. 
Moreover, except for a = A where x" is initialized by x", for each other value of a, the algorithm starts from the 
quantized sequence found for the previous value of a. 

As another example. Fig. |2] compares the performance of Alg. [T| when applied to a binary symmetric Markov 
source (BSMS) with transition probability p = 0.25 against the Shannon lower bound (SLB) which sates that for 
a BSMS 

R{D)>RsLB{D)^h{p)~h{D). (57) 

There is no known explicit characterization of the rate-distortion tradeoff for a BSMS except for a low distortion 
region. It has been proven that for D < Dc, where 



Dr 



v/1 - {P/<1)' 



(58) 



the SLB holds with equality, and for D > Dc, we have strict inequality, i.e. R{D) > i?sLB B2l . In our case Dc = 
0.0286 which is indicated in the figure. For distortions beyond Dc, an upper bound on the rate-distortion function, 
derived based on the results presented in Q, is shown for comparison. The parameters here are: n = 2 x 10^, 
fc = 8, A = (1/7)^*/"!, 7 0.8, r = lOn and a = 5 : -0.5 : 3. 

To illustrate the encoding process. Fig. [3] depicts the evolutions of Hk{x^), dn{x" , x"), and £{x") = Hk{x") + 
a(i„(x", i") during coding iterations. It can be observed that, as time proceeds, while the complexity of the sequence 
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Fig. 2. Comparing the algorithm rate-distortion performance with Shannon lower bound for a BSMS(p) (p = 0.2, n = 2 X 10'*, fc = 8, 
/3t = (l/7)ri/il, -y = 0.8, r = lOn and o = 5 : -0.5 : 3) 



has an overall decreasing trend, as expected, its distance with the original sequence increases. The over cost which 
we are trying to minimize, increases initially, and starts a decreasing trend after a while. Here the source again is 
a Bcrn(p) source, but with p = 0.2. The algorithm parameters are n = 2 x 10"', fc = 9, a = 4, r = lOn, and 
(3t ~ (l/7)r*/"l, where 7 = 0.7. Fig. |4] shows similar curves when the source is binary Markov with transition 
probability p = 0.2. The other parameters are n = 10"*, fc = 7, a = 4, ?- = lOn, and fit = (1/7)'^*^"^ where 
7 = 0.8. 

Finally, consider applying the algorithm to the n x n binary image shown in Fig. |6], where n = 252. Let N = in? 
denote the total number of pixels in the image. Fig. |7(a)| and Fig. |7(b)| show the coded version after r ~ bQN 
iterations for a = 0.1 and a = 3.3 respectively. The algorithm's cooling process is (5t ~ (l/7)'^*/"l with 7 = 0.99. 
Fig|5] shows the 2-D context used for constructing the count matrix of the image that is used by the algorithm. In 
the figure, the solid black square represents the location of the current pixel, and the other marked squares denote 
its 6*^ order causal context that are taken into account. 

Fig. |7(a)| the empirical conditional entropy of the image has decreased from 0.1025 to 0.0600 in the reconstruction 
image, while an average distortion of D ~ 0.0337 per pixel is introduced. Comparing the required space for storing 
the original image as a PNG file with the amount required for the coded image reveals that in fact the algorithm 
not only has reduced the conditional empirical entropy of the image by 41.5%, but also has cut the size of the file 
by around 39%. Fig. [8] shows the size of the compressed image in terms of the size of the original image when a 
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Fig. 3. Sample paths demonstrating evolutions of the empirical conditional entropy, average distortion, and energy function when Alg. [T] is 
applied to the output of a Bern(p) i.i.d source, with p = 0.2 (n = 2 X 10'', k = 9, a = 4, r = lOn, and /3t = (I/7) , where 7 = 0.7 ). 

varies as 0.1 : 0.4 : 3.3. 

B. Discussion on the choice of different parameters 

1 ) Context length k: As stated in Theorem [l] in order to get to the optimal performance, k should increase as 
o(logn). For getting good performance, it is crucial to choose k appropriately. Note that the order k determines 
the order of the count matrix m which is used to measure the complexity of the quantized sequence. Choosing k 
to be too big or too small compared to the length of our sequence are both problematic, if k is too small, then 
the count matrix m will not capture all useful structures existing in the sequence. These structures potentially help 
the universal lossy coder to describe the sequence with fewer number of bits. On the other hand, if k is too large 
compared to the block length n, then Hk{y'"') gives a unreliable underestimate of the complexity of the sequence. 
One reason is that in this the counts are mainly or some small integer. 

In order to demonstrate the effect of the context length k on the Algorithm performance, consider applying Alg.[T] 
to a binary symmetric Markov source with transition probability p = 0.2. Fig. |9] shows the average performance 
over / = 50 iterations. The performance measure used in this figure is the average energy of the compressed 
sequences, i.e., £{x'^) = Hk{x^) + a(i„(a;", i"), for different values of a. It can be observed that k = b and = 6 
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Fig. 4. Sample paths demonstrating evolutions of the empirical conditional entropy, average distortion, and energy function when Alg. [T] is 
appUed to the output of a BSMS source, with p = 0.2 (n = 10'', fc = 7, a = 4, r = lOn, and /St = (I/7) ^'^"1 , where 7 = 0.8 ). 




Fig. 5. The 6' order context used in coding of 2-D images 

have almost similar performances, but increasing k to 7 improves the performance noticeably. In each iteration a 
BSMS(p) sequence of length n = 10'* is generated, and is coded by Alg. [T] for fc = 5, fc = 6 and fc = 7. In all 
cases the cooling schedule is fixed to (3t = (1/7)^*/"!, where 7 = 0.75. For each value of fc, and each simulated 
sequence, the algorithm starts from a = 4 and step by step decreases it to 2. 

2) Block length n: Fig. [TO] shows the effect of increasing the block length on the minimized cost function for 
a fixed fc. The source is again BSMS(p) with p ~ 0.2. The other parameters are fc = 7, a = 4 : —0.5 : 2, 
= (1/7)'^*^"^ with 7 = 0.75, and r = lOn. Here, each point corresponds to the average performance over 
/ = 50 iterations. It can be observed that somewhat counter-intuitively, increasing the block length increases 
the minimized cost. The reason is that as mentioned earlier, the real cost is not Hk{x^) + Q!(i„(a;", x"), but is 
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Fig. 6. Original image with empirical conditional entropy of 0.1025 

^Lz(i")/"-+a(i„(a;", x"). This increase in the cost is an indication of the fact that Hk{-) underestimates l]_,i{x'^) / n. 
As n increases the gap between Hk{ ) and ^?Lz(i")/"- closes, and the estimate becomes more accurate. Note that 
while increasing n from lO'' increases the cost noticeably, but from n = 2 x lO"* to n = 5 x lO"' the increase 
is almost negligible which somehow suggests that increasing n further will not improve the performance, and for 
achieving better performance we need to increase k as well as n. 

3) Cooling schedule {Pt}-' In all of our simulations the cooling schedule follows the generic form of Pt = 
^0 (1/7) for some 7 < 1, but usually > 0.7. This is a common schedule used in simulated annealing literature. 
By this scheme, the running time is divided into intervals of length n, and the temperature remains constant during 
each interval, and decreases by a factor 7 in the next interval. Hence larger values of 7 correspond to slower cooling 
procedures. The specific values of 7 and /?o can be chosen based on the signal to be coded. 

4) Number of iterations r: Although we have not yet derived a convergence rate for Alg. [T] from our simulations 
results, we suspect that for natural signals not having strange characteristics, r = mn iterations, where m = o(log n), 
is enough for deriving a reasonable approximation of the solution to the exhaustive search algorithm. However, we 
do not expect similar result to hold for all signals, and there might exist sequences such that the convergence rate 
of simulated annealing is too slow for them. 

C. Sliding-block coding 

Consider applying Alg.|3]of Section IVlto the output of a BSMS with q = 0.2. Fig. [TTI shows the algorithm output 
along with Shannon lower bound and lower/upper bounds on R{D) from Q- Here the parameters are: n = 5 x 10^, 
fc = 8, SB window length of kf = 11 and = Kfa\og{t + 1). 

In all of the presented simulation results, it is the empirical conditional entropy of the final reconstruction block 
that we are comparing to the rate-distortion curve. It should be noted that, though this difference vanishes as 
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(a) Reconstruction image witli empirical conditional entropy of 0.0600 
and average distortion of 0.0337 per pixel (a = 0.1). 




(b) Reconstruction image with empirical conditional entropy of 0.0824 
and average distortion of 0.0034 per pixel (a = 3.3). 

Fig. 7. Applying Alg.[T]to a 2-D binary image (ft = (I/t)^*/"!, where 7 = 0.99, r = 507V) 

the block size grows, for finite values of n there would be an extra (model) cost for losslessly describing the 
reconstruction block to the decoder. 

VII. Application: Optimal denoising via MCMC-based lossy coding 

Consider the problem of denoising a stationary ergodic source X with unknown distribution corrupted by additive 
white noise V. Compression-based denoising algorithms have been proposed before by a number of researchers, cf. 
1431 . II44I . B31 and references therein. The idea of using a universal lossy compressor for denoising was proposed 
in ll44ll . and then refined in B31 to result in a universal denoising algorithm. In this section, we show how our 
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Fig. 8. Size of the compressed image in terms of the entropy of the original image (in precentage) versus distortion (a = 0.1 : 0.4 : 3.3, 

/3t = (l/7)r*/"l, where 7 = 0.99, r = 50N). 



new MCMC -based lossy encoder enables the denoising algorithm proposed in ||45l to lead to an implementable 
universal denoiser. 

In ||45l , it is shown how a universally optimal lossy coder tuned to the right distortion measure and distortion 
level combined with some simple "post-processing" results in a universally optimal denoiser. In what follows we 
first briefly go over this compression-based denoiser described in ^45l . and then show how our lossy coder can be 
embedded in for performing the lossy compression part. 

Throughout this section we assume that the source, noise, and reconstruction alphabets are Ai-aiy alphabet 
A = {0, 1, . . . , M — 1}, and the noise is additive modulo-M and Py(a) > for any a E A, i.e. Zi = Xi + Vi. 

As mentioned earlier, in the denoising scheme outlined in 1451 . first the denoiser lossily compresses' the noisy 
signal appropriately, and partly removes the additive noise. Consider a sequence of good lossy coders characterized 
by encoder/decoder pairs (E„, D„) of block length n working at distortion level H{V) under the difference distortion 
measure defined as 

y) = log ——^ -. (59) 

Pv[x - y) 

By good, it is meant that for any stationary ergodic source X, as n grows, the rate distortion performance of the 
sequence of codes converges to a point on the rate-distortion curve. The next step is a simple "post-processing" 
as follows. For a fixed m, define the following count vector over the noisy signal and its quantized version 
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Fig. 9. Effect of context length on the algorithm performance in coding a BSMS(p) for different values of q (p = 0.2, q = 4 : —0.5 : 2, 
/3t = (1/7)^*/"!, where 7 = 0.75, and r = lOn). 



y" = D„(E„(Z")), 

-|{1 <i<n: (Zl+lYi) - (z2"+i,y)}|. (60) 
n 

After constructing these count vectors, the denoiser output is generated through the "post-processing" or "de- 
randomization" process as follows 

X, = argmin V Q2'»+i[Z",F"](z2'^+i,?/)d(x,y), (61) 

where d{-,-) is the original loss function under which the performance of the denoiser is to be measured. The 
described denoiser is shown to be universally optimal ll45l . and the basic theoretical justification of this is that the 
rate-distortion function of the noisy signal Z under the difference distortion measure satisfies the Shannon lower 
bound with equality, and it is proved in B31 that for such sources ^ for a fixed fc, the fc-th order empirical joint 

^In fact it is shown in [45 1 that this is true for a large class of sources including i.i.d sources and those satisfying the Shannon lower bound 
with equality. 
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Fig. 10. Effect of block length n on the algorithm performance in coding a BSMS(p) for different values of a (p = 0.2, fc = 7, o = 4 : —0.5 : 2, 
/3t = (1/7) r*/"! _ where 7 = 0.75, and r = lOn). 

distribution between the source and reconstructed blocks defined as 

gfe[^„^y„](^fc^yfc) A (62) 

^|{1 <t<n: {X]+^-\Yi+^-') = 
resulting from a sequence of good codes converge to Px'' .y>' in 

distribution, i.e. 4> Pxky"^ where 

Px'^.Y'' is the unique joint distribution that achieves the fc-th order rate-distortion function of the source. In the 
case of quantizing the noisy signal under the distortion measure defined in ( |59] l, at level H{V), Pxk yk is the fc-th 
order joint distribution between the source and noisy signal. Hence, the count vector Q^™+^[Z",y"](z^"'+^,y) 
defined in (|60] | asymptotically converges to Pxi\Z'^ which is what the optimal denoiser would base its decision on. 
After estimating PxilZ"- the post-processing step is just making the optimal Bayesian decision at each position. 

The main ingredient of the described denoiser is a universal lossy compressor. Note that the MCMC-based 
lossy compressor described in Section V is applicable to any distortion measure. The main problem is choosing 
the parameter a corresponding to the distortion level of interest. To find the right slope, we run the quantization 
MCMC-based part of the algorithm independently from two different initial points ai and a2- After convergence of 
the two runs we compute the average distortion between the noisy signal and its quantized versions. Then assuming 
a linear approximation, we find the value of a that would have resulted in the desired distortion, and then run the 
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Fig. 11. Comparing the algorithm rate-distortion performance with the Shannon lower bound for a BSMS with q = 0.2. Algorithm parameters: 
n = 5 X 10'', k = S, kf = 5 (Kf = 2"), /3t = Kfalog(t + 1), and slope values a = 5.25, 5, 4.75 and 4.5. 

algorithm again from this starting point, and again computed the average distortion, and then find a better estimate 
of a from the observations so far After a few repetitions of this process, we have a reasonable estimate of the 
desired a. Note that for finding a it is not necessary to work with the whole noisy signal, and one can consider 
only a long enough section of data first, and find a from it, and then run the MCMC -based denoising algorithm 
on the whole noisy signal with the estimated parameter a. The outlined method for finding a is similar to what is 
done in 1461 for finding appropriate Lagrange multiplier. 

A. Experiments 

In this section we compare the performance of the proposed denoising algorithm against discrete universal 
denoiser, DUDE 1471 . introduced in l48l . DUDE is a practical universal algorithm that asymptotically achieves the 
performance attainable by the best n-block denoiser for any stationary ergodic source. The setting of operation of 
DUDE is more general than what is described in the previous section, and in fact in DUDE the additive white noise 
can be replaced by any known discrete memoryless channel. 

As a first example consider a BSMS with transition probabiUty p. Fig. [14] compares the performance of DUDE 
with the described algorithm. The slope a is chosen such that the expected distortion between the noisy image and 
its quantized version using Alg.[T]is close to the channel probability of error which is (5 = 0.1 in our case. Here we 
picked a = 0.9 for all values of p and did not tune it specifically each time. Though, it can be observed that, even 
without optimizing the MCMC parameters, the two algorithms have similar performances, and in fact for small 
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Fig. 12. The 4**^ order context used by DUDE in 2D image denoising example 

values of p the new algorithm outperforms DUDE. 

In another example, let us consider denoising the binary image shown in Fig. |6] Fig. [15] shows its noisy version 
which is generated by passing the original image through a DMC with error probability of 0.04. Fig. |16(a)| shows the 
reconstructed image generated by DUDE and |16(b)| depicts the reconstructed image using the described algorithm. 
In this experiment the DUDE context structure is set as Fig. [12] The 2D MCMC coder employs the same context as 
the one used in the example of Section IVI-AI which is shown in Fig. [5] and the derandomization block is chosen 
as Fig. [T3] 




Fig. 13. The de-randomization block used in MCMC-based denoising of a 2-D image example 

Discussion: The new proposed approach which is based on MCMC coding plus de -randomization is an alternative 
not only to the DUDE, but also to MCMC-based denoising schemes that have been based on and inspired by the 
Geman brothers' work lITTI . While algorithmically, this approach has much of the flavor of previous MCMC-based 
denoising approaches, ours has the merit of leading to a universal scheme, whereas the previous MCMC-based 
schemes guarantee, at best, convergence to something which is good according to the posterior distribution of the 
original given the noisy data, but as would be induced by the rather arbitrary prior model placed on the data. It is 
clear that here no assumption about the distribution/model of the original data is made. 

VIII. CONCLUSIONS AND FUTURE WORK 

In this paper, a new implementable universal lossy source coding algorithm based on simulated annealing Gibbs 
sampling is proposed, and it is shown that it is capable of getting arbitrarily closely to the rate-distortion curve of 
any stationary ergodic source. For coding a source sequence x", the algorithm starts from some initial reconstruction 
block, and updates one of its coordinates at each iteration. The algorithm can be viewed as a process of systematically 
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Fig. 14. Comparing the denoiser based on MCMC coding plus de-randomization with DUDE and optimal non-universal Bayesian denoiser 
which is implemented via forward-backward dynamic programming. The source is a BSMS(p), and the channel is assumed to be a DMC with 
transition probability 5 = 0.1. The DUDE parameters are: fcietf = bright = 4, and the MCMC coder uses a = 0.9, /3t = 0.5 log i, r = lOn, 
n = le4, k = 7. The de-randomization window length is 2 X 4 + 1 = 9. 




Fig. 15. Noisy image corrupted by a BSC(0.04) 
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(b) MCMC coder + de-randomization reconstruction image with di\f(x^ , x^) = 
0.0128: a = 2, ft = Slogt, r = lOAr, 

Fig. 16. Comparing the performance of MCMC-based denoiser with the performance of DUDE 
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introducing 'noise' into the original source block, but in a biased direction that results in a decrease of its description 
complexity. We further developed the application of this new method to universal denoising. 

In practice, the proposed algorithms [T] and [3] in their present form, are only applicable to the cases where the size 
of the reconstruction alphabet, is small. The reason is twofold: first, for larger alphabet sizes the contexts will 
be too sparse to give a true estimate of the empirical entropy of the reconstruction block, even for small values of k. 
Second, the size of the count matrix m grows exponentially with \X\ which makes storing it for large values of jA"! 
impractical. Despite this fact, there are practical applications where this constraint is satisfied. An example is lossy 
compression of binary images, like the one presented in Section |VI] Another application for lossy compression of 
binary data is shown in 1491 where one needs to compress a stream of and 1 bits with some distortion. 

The convergence rate of the new algorithms and the effect of different parameters on it is a topic for further study. 
As an example, one might wonder how the convergence rate of the algorithm is affected by choosing an initial 
point other than the source output block itself. Although our theoretical results on universal asymptotic optimality 
remain intact for any initial starting point, in practice the choice of the starting point might significantly impact the 
number of iterations required. 

Finally, note that in the non-universal setup, where the optimal achievable rate-distortion tradeoff is known in 
advance, this extra information can be used as a stopping criterion for the algorithm. For example, we can set it to 
stop after reaching optimum performance to within some fixed distance. 

APPENDIX A: Proof of Theorem[2] 

Our proof follows the results presented in ISOl . Throughout this section. An = denotes the state space of 
our Markov chain (MC), P defines a stochastic transition matrix from An to itself, and tt defines a distribution on 
An satisfying ttP = tt. Let N = jA"!" denote the size of the state space, and, for i G {!,..., N}, let pi represent 
the i"^ row of P. 

Definition 1 (Ergodic coefficient): Dobrushin's ergodic coefficient of P , 6{P), is defined to be 

(5(P) = max IIPi-PjIItv- (A-1) 

l<i,j<N 

N 

From the definition, < 5{P) < 1. Moreover, since ||pi — PjUxv = 1 ^ miii(pj;fe,Pjfc)' the ergodic coefficient 

k=l 

can alternatively be defined as 

N 

S{P) = 1- min 'S^mm{pik,Pjk)- (A-2) 

l<i.i<N ^ — ' 
- k=l 

The following theorem states the connection between the ergodic coefficient of a stochastic matrix and its conver- 
gence rate to the stationary distribution. 

Theorem 4 (Convergence rate in terms of Dobrushin's coefficient): Let /i. and u be two probabiUty distributions 
on An- Then 

|l/xP*-z/P*||i < ||/x-i/||i(5(P)* (A-3) 
Corollary 1: By substituting i> = n in (IA-3b . we get ||/xP* — 7r||i < \\fj, — 7rj|i(5(P)*. 
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Thus far, we talked about homogenous MCs with stationary transition matrix. However, in simulated annealing 
we deal with a nonhomogeneous MC. The transition probabilities of a nonhomogeneous MC depend on time and 
vary as time proceeds. Let P^*) denote the transition Matrix of the MC at time t, and for < ni < 712 G IN, define 
p(ni,n2) A pi"^^^^ p(*'. By this definition, if at time ni the distribution of the MC on the state space An is /.t„^, 
at time 112, the distribution evolves to /x^^ ~ /x^j^P'^"!'"^'. The following two definitions characterize the steady 
state behavior of a nonhomogeneous MC. 

Definition 2 (Weak ergodicity): A nonhomogeneous MC is called weakly ergodic if for any distributions /x and 
u over An, and any ni e IN, 

limsup||/.tP("i'"2) -z/P("i'"2)||i ^ 0. (A-4) 

712— > 00 

Definition 3 (Strong ergodicity): A nonhomogeneous Markov chain is called strongly ergodic if there exists a 
distribution over the state space An such that for any distributions fi and ni e M, 

limsup ||//P("i'"^) - 7r||i = 0. (A-5) 

n2^oo 

Theorem 5 (Block criterion for weak ergodicity): A MC is weakly ergodic iff there exists a sequence of integers 
< ni < n2 < . . ., such that 

QO 

^(1 - <S(p(""»'+i))) ^ oo. (A-6) 
1=1 

Theorem 6 (Sufficient condition for strong ergodicity): Let the MC be weakly ergodic. Assume that there exists 
a sequence of probability distributions, {tt^*)}^]^, on An such that Then the MC is strongly 

ergodic, if 

oo 

^||7r« -7r(^+i)||i <C30. (A-7) 



After stating all the required definitions and theorems from IdOII . finally we get back to our main goal which 
was to prove that by the mentioned choice of the {fit} sequence. Algorithm [T| converges to the optimal solution 
asymptotically as block length goes to infinity. Here P(^\ the transition matrix of the MC at the j*'^ iteration, 
depends on jij. Using Theorem |5] first we prove that the MC is weakly ergodic. 

Lemma 1: The ergodic coefficient of p(J"'(J+i)"), for any j > is upper-bounded as follows 

J(^p(j".0 + 1)")) < 2 - g-"(/3jA+e„)^ (A-8) 

where 

A = max 5i, (A-9) 

ie{l,...,n} 



for 



and 



logfn!) 

logn ^L^. (A-10) 
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Proof: Let y" and t/J be two arbitrary sequences in A"". Since the Hamming distance between these two 
sequence is at most n, starting from any sequence y", after at most ?7 steps of the Gibbs sampler, it is possible to 
get to any other sequence 1)2 ■ On the other hand at each step the transition probabilities of jumping from one state 
to a neighboring state, i.e., 

P [u ""'+1^- „ ^ exp(-A£:(u-i6<+i))' ^^^^^ 

66 A' 

can be upper bounded as follows. Dividing both the numerator and denominator of jA-l II) by exp{— Pt£niin,i ^ , 

where fmin,j(u'"\ m", 1) = min£{u'~^bu'^^), we get 

hex 



p(*)f,/-i/,„" \v^-'ov- exp(-A(gK ^a<_,i)-g.nin,.K \<+i))) 



(A-12) 



Therefore, 



> ^. (A-13) 

- n\X\ 

jnin p(^"'(^+lW(y5',y2") > — FT = ^ ^ , (A-14) 

_ jn-\-n—l 

where /3, = i ^ ft. 

t—jn 

Using the alternative definition of the ergodic coefficient given in ( IA-21) . 

^(pO-",(j+i)")) ^ 1 _ miii(p(^"-(^+i'")(y^\z"),p(^"'(^'+i)")(yJ,z")) 

< 1 - li-l" J_e-"(A-^+^") (A-15) 

= l_e-»(ft^+<^"). (A-16) 



Corollary 2: Let ft = ^"^''^"J,^"'"'' , where Tq"'' = ctiA, for some c > 1, and A is defined in ( IA-9l l, in Algorithm 
[U Then the generated MC is weakly ergodic. 

Proof: For proving weak ergodicity, we use the block criterion stated in Theorem |5] Let nj = jn, and note 
that ft = '°s(J+i) jjj jjjjg case. Observe that 



^(1 - 5(pK'".+i))) = ^(1 _ ^(p(i".(j+i)"))) 
j=o i=i 



j=0 



J=0 



e— .5:-i,=oo. (A-19) 

j=i 
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This yields the weak ergodicity of the MC defined by the simulated annealing and Gibbs sampler. ■ 
Now we are ready to prove the result stated in Theorem |2] Using Theorem |6l we prove that the MC is in fact 

strongly ergodic and the eventual steady state distribution of the MC as the number of iterations converge to infinity 

is a uniform distribution over the sequences that minimize the energy function. 

At each time t, the distribution defined as 7r(*'(y") — e~'^'^'^"'/^/3t satisfies 7r(*^P(*^ ~ tt^*\ Therefore, if we 

prove that 

oo 

Elk*'^ -7r(*+i)||i < oo, (A-20) 
t=i 

by Theorem |6] the MC is also strongly ergodic. But it is easy to show that tt^*^ converges to a uniforms distribution 
over the set of sequences that minimize the energy function, i.e., 

lim 7r(*)(2/") = <^ , y ^ , 

where U = {y" : f (y") = min 

Hence, if we let X" denote the output of Algorithm [T] after t iterations, then 

lim = min £(?/"), (A-22) 

t-S-oo y-^^X-^ 

which combined with Theorem [T] yields the desired result. 

In order to prove (IA-20b . we prove that 7r'*'(?/") is increasing on and eventually decreasing on , hence 
there exists to such that for any t\ > to, 

t=to y"GKt=to y"G^"\'H*=*o 

< ^(1)- (A-23) 

Since the right hand side of ( IA-231 ) of does not depend on ti, X!t^o IK''*'' ~ < oo- Finally, in order to 

prove that 7r(*)(?/") is increasing for y" e H, note that 

7r(*)(2/"): 



J2 e-'^'^f-'") 

z"£X" 

1 



(A-24) 



Since for y" G V. and any 2" e A"", £(2") - £{y"-) > 0, if ii < ^2, 
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and hence 7r(*i)(y") < 7r(*2)(2/"). On the other hand, if y" ^ U, then 

(*)r,/M = - 



1 



(A-25) 



For large (3 the denominator of (IA-25I I is dominated by the second term which is increasing in pt and therefore 
7r(t)(y") vvill be decreasing in t. This concludes the proof. 

APPENDIX B: Proof of Theorem[3] 

First we need to prove that a result similar to Theorem [T] holds for SB codes. I.e., we need to prove that for 
given sequences {A: !"''}„ and {fc„}„ such that lim A;!"' = 00 and fc„ — o(logrt), finding a sequence of SB codes 
according to 

Z^"' =argmin£(/4"'), (B-1) 

f I 

where E{f^i^ ) is defined in gSll and xf^ = 2^^t^ +\ results in a sequence of asymptotically optimal codes for 
any stationary ergodic source X at slope a. In other words, 

1 



lim 

n— )-oo 



-4z(^") + ad„(X",X") 



= mill [R{D, X) + aD] , a.s. (B-2) 



_D>0 



where X" /^/" ]. After proving this, the rest of the proof follows from the proof of Theorem |2] by just 

redefining 6i as 



max ■ 



For establishing the equality stated in (IB-2l i. similar to the proof of Theorem [T] we prove consistent lower and 
upper bounds which in the limit yield the desired result. The lower bound. 



lim inf E 



-4z(^") + ad(X",l") 
n 



> mm [R{D, X) + aD] , (B-3) 



follows from part (1) of Theorem 5 in f32l. For proving the upper bound, we split the cost into two terms, as done 
in the equation (fTsl l. The convergence to zero of the first term again follows from a similar argument. The only 
difference is in upper bounding the second term. 

Since, asymptotically, for any stationary ergodic process X, SB codes have the same rate-distortion performance 
as block codes, for a point {R{D,X.), D) on the rate-distortion curve of the source, and any e > 0, there exits a 

satisfies 

1) H{X)<R{D,X), 

2) Ed{Xo,XQ) <D + e. 



SB code /2'^/+i of some order k% such that coding the process X by this SB code results in a process X which 
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On the other hand, for a fixed n, £{f^f) is monotonically decreasing in Kf. Therefore, for any process X and 

(n) 

any S > 0, there exists ns such that for n> ng and > 1%^^ 

< R{D, X) + a{D + e) + (5, w.p. 1. (B-4) 
Combining ( IB-31 ) and ( IB-41 ). plus the arbitrariness of e, 5 and D yield the desired result. 
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